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Abstract 


Pan and Reif have shown that Newton iteration may be used to compute the inverse of an 
nxn, well-conditioned matrix in parallel time OQog^i) and that this computation is processor 
efficient Since the algorithm essentially amounts to a sequence of matrix-matrix multiplications, 
it can be implemented with great efficiency on systolic arrays and parallel computers. 

Newton’s method is expensive in terms of the arithmetic operation count. In this paper we 
reduce the cost of Newton’s method with several new acceleration procedures. We obtain a 
speedup by a factor of two for arbitrary input matrices; for symmetric positive definite matrices, 
the factor is four. We also show that the accelerated procedure is a form of Tchebychev accelera- 
tion, while Newton’s method uses instead a Neumann series approximation. 
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In addition, we develop Newton-like procedures for a number of important related prob- 
lems. We show how to compute the nearest matrices of lower rank to a given matrix A, the gen- 
eralized inverses of these nearby matrices, their ranks (as a function of their distances from A), 
and projections onto subspaces spanned by singular vectors; such computations are important in 
signal processing applications. Furthermore, we show that the numerical instability of Newton’s 
method when applied to a singular matrix is absent from these improved methods. Finally, we 
show how to use these tools to devise new polylog time parallel algorithms for the singular value 
decomposition. 

Key Words: Matrix computation, Moore-Penrose generalized inverse, singular value 
decomposition, Newton iteration. 

Acknowledgements. We would like to thank the referees for many valuable suggestions, 
especially for helping to simplify and clarify the proofs; Roland Freund and Youcef Saad for sug- 
gesting the Tchebychev analysis; and Sally Goodall for expertly typing the manuscript. 

1. Introduction 

Pan and Reif [11], [12] have shown that Newton iteration may be used to compute the 
inverse of an nxn, well-conditioned matrix in parallel time proportional to log^ using a number 
of processors that is within a factor of log n of the optimum. Newton iteration is simple to 
describe and to analyze, and is strongly numerically stable for nonsingular input matrices; this is 
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not true of earlier polylog time matrix inversion methods. Moreover, since it is rich in matrix- 
matrix multiplications, it can be implemented with great efficiency on systolic arrays and parallel 
computers. 1 

Unfortunately, the computation can be rather expensive. It has been shown that the number 
of matrix multiplications required can be as high as 4 log k(A), where k(A) e I I A I 1 2 I I A -1 1 1 2 . 

Here we show how to reduce this cost, in some cases quite dramatically. We accelerate and 
extend the usual Newton iteration in several ways. In particular, we 

• substantially accelerate the convergence of the iteration; 

• insure its stability even when A is singular, 

• show how to compute the matrix A(e) obtained from A by setting to zero any of its singular 
values that are less than e. Thus A(e) is a closest lower rank approximation to A (see 
Theorem 2.1 below); 

• compute A + (e), the Moore-Penrose generalized inverse (also called the pseudo-inverse) of 
A(e); 

• compute the rank of A(e); 

1 (On the Connection Machine [7] matrix products may be computed at 5x10^ operations per second, but 
matrix computations done in standard languages rarely can exceed 10* operations per second. Furthermore, 
computer manufacturers are currently being encouraged to provide the fastest matrix product software possi- 
ble, since other matrix computations (QR and LU decomposition, in particular) may be computed with algo- 
rithms that are rich in matrix multiply [4].) 
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• compute the matrix P(e) that projects orthogonally onto the range of A(e). 

Concerning acceleration, we obtain a twofold speedup by scaling the iterates at each step, 
and we prove that the scaled iterates are defined by Tchebychev polynomials that are, in the usual 
minimax sense, optimal in a subspace of polynomials in A T A. In the case of symmetric positive 
definite matrices, we get another speedup by a factor of two through a new means of constructing 
the initial iterate. We also consider adaptive procedures for which we prove no such a priori 
lower bound on speedup, but which promise to be useful in practice. Our results have further 
applications, in particular, to signal processing and to computing the SVD of a matrix. 

Concerning efficiency in highly parallel computing environments, we do not claim that the 
present methods are always advantageous. The alternative, for most computations discussed 
here, is a parallel computation of the full SVD of A. The computation of the SVD in a highly 

parallel environment is best done using a square array of -JJ-Xy processors, implementing a 

Jacobi-like method due to Kogbetliantz [3]. (This method is not competitive with standard 
methods in terms of operation count, but the standard methods are not well suited to highly paral- 
lel architectures). Good experimental evidence is available to show that from six to ten sweeps of 
the Kogbetliantz method are needed. For real A, each sweep requires 8n 3 multiplications. The 
sequential operation count of this method is therefore the same as that of 32 — 52 Newton itera- 
tions. Thus, Newton’s method is competitive with a Kogbetliantz SVD for these problems on 
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highly parallel computers. It is a clear winner if the condition number of A(e) is not too large, or 
if the adaptive acceleration we employ is especially successful, so that far fewer than 30 Newton 
steps are needed, or if we can exploit sparsity or other properties of A to reduce its cost Further- 
more, it is often the case on parallel machines that matrix products, the core of the Newton itera- 
tion, can be computed especially fast. (On the Connection Machine, microcoded matrix multiply 
runs an order of magnitude faster than code written in Fortran, C*, or *LISP.) 

Of course, other parallel methods for the SVD may arise. And when the number of proces- 
sors is not large, so that 0(n 2 ) processors are not available, then more modestly parallel, but less 
costly methods for the SVD are better. 

1.1 Contents. 

We organize the paper as follows: Section 2 is for definitions. In Section 3, we recall the 
customary Newton iteration for matrix inversion; in Section 4 we present a Tchebychev accelera- 
tion procedure and in Section 5 an adaptive acceleration using cubic polynomials rather than the 
quadratics used in Newton’s method. We give a method for finding an improved initial iterate for 
symmetric positive definite A in Section 6. In Sections 7-9 we give methods for computing the 
matrices A(e) and A + (e) (see above), prove the stability of these computations, show how to find 
subspaces spanned by singular vectors, and show some further applications. Some numerical 


experiments are presented in Section 10. 
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2. Notation 

We denote most matrices by upper case Roman letters, diagonal matrices by upper case 
Greek letters, and the elements of a matrix by the corresponding lower case Greek letter. We also 
use lower case Greek letters for various scalars, lower case bold Roman letters for vectors and in 
particular, columns of matrices. The letters i, j, k, m, n, r, and s are used for integers. We assume 
that all the quantities are real; extension to the complex case is straightforward. 

The basis for our analysis is the singular value decomposition (hereafter referred to as the 
SVD) of A g R mxn , 

A = U2V T , (2.1) 

where U = [ui, ,u m ] and V = [v 1( ■ • • ,v„] are square orthogonal matrices and 

I = diag(Oi > 02 £ • • ■ £ o r > o r+ i = • • • = <j„ = 0). Here, r = rank(A). The generalized inverse 

of A is 

A + = V2 + U t 

where 

2 + sdiag(af , ,02 1 , .or’.O, ••• ,0). 

If A is symmetric and positive definite, then m = n and U = V. In this case, the eigenvalues 
of A are Oj and the eigenvectors are vj, 1 < j < n. 

We shall use the Euclidean vector norm 1 1 x I 1 2s(x T x) 1/2 and the following matrix norms: 


I I A I 1 2 = maj I I Ax I 1 2 / M x I 1 2 , 
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1 1 A 1 1 1 s max £ I otjj I , 

J 1 

1 1 A I I o- s max £ I Oij I . 

1 j 

f ^ 

__ tm 1 /2 

IIAIIp H Sett] . 
jj 

V ^ 

It is known [6] that if A has the SVD (2.1) then I I A I 1 2 = CTj, and 
I IAI I f = I 121 I f = (£j Oj 2 ) 1/2 . The following theorem (the Eckart- Young Theorem) can be 
found in [6]. p. 19. 

Theorem 2.1. Let A. be a matrix of rank r with SVD (2.1). Let s<ibean integer, and let 

A » s ^ UjOjVj T . 

Then 

1 IA_A ‘ 1 ,2 = JS^. s 1 IA_BI l2 = o.+i ■ 

We shall make use, too, of the spectral condition number k(A) of A, which is defined by 

k(A) b I IAI 1 2 1 1 A + 1 1 2 = Oi/ ct r . (2.2) 

The chief reason for being interested in the generalized inverse is that the solution to the least 

squares problem 

min I I Ax - b I 1 2 

having smallest norm I I x I 1 2 is A + b. 

3. Newton iteration 
Newton iteration 

X k+l = X k (2I-AX k ) = (2I-X k A)X k (3.1) 

was proposed by Schulz in 1933 for computing the inverse of a nonsingular matrix A [16]. Much 
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later, Ben-Israel and Cohen proved that the iteration converges to A + provided that 

Xo = otoA T , (3.2) 

with oto positive and sufficiently small [1], [2], 

All the following discussion employs an SVD-based analysis of the method, first developed 
by Soderstrom and Stewart [17], who observed that (2.1), (3.1) and (3.2) imply that each iterate 
has an SVD of the form 


x k = vn k u T 

with the matrices U and V of (2.1). The products X k A therefore satisfy 


X k A = VR k V T (3.3) 

where R k = E k Z * diag(pf k) ,p^ t \ • • ■ ,p n W); moreover, for all 1 £ j £ n and all k > 0, 

1 - p/k+n = (p/k) - 1)2. (3.4) 

dearly, p/ 0) = aooj 2 . For any oto< 2/ of, (3.4) implies the quadratic convergence of to one 

for all j, 1 £ j < r, and, therefore, of X k to A + . The optimum choice of oto in (3.2), which minim- 
izes 1 1 1 - XqA I 1 2 by making pf°> -1 = 1- p/°), is 


2 

o? + o r 2 

Let k(A) be given by (2.2). Then with the choice (3.5), 


(3.5) 


P ' <0)= TTW' <36) 

so that p/ 0) c 1 if k(A) is large. In practice, information about a r is hard to get. We may instead 


use a suboptimal but nevertheless safe alternative, such as 
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“° I I A I I /l I A I I M ‘ (3 ' 7) 

Other choices of ao which do not require an estimate of o r are available [1], [2], [11], [12]. 

It follows from (3.4) that small singular values p approximately double at every step. 
Therefore, it takes about 2 log 2 K(A) steps of (3.1) to get to the point where pf® > Vi and an addi- 
tional log log (1/e) iterations for convergence of all the pj to within e of 1. Thus, if the floating- 
point operations are the measure of cost, the method is more expensive than the conventional 
alternatives. The reason for our interest in this method is that (3.1) essentially amounts to two 
matrix multiplications, which can be very efficiently implemented on systolic arrays and on vec- 
tor and parallel computers ([13], [14]). Further savings are possible: if A is sparse, then X k A is 
less costly to compute: in the case of Toeplitz and many other structured matrices, AX k amounts 
to a few matrix-vector products (see [8], [9], [10]). Finally, as only the product of A and some 
vectors is required, we do not need to form A, which is convenient in some applications. 

In operations counts, we use the term “flop” to mean a multiply-add pair. For unstructured 
mxn matrices, each iteration step (3.1) essentially amounts to two matrix multiplications, that is 
to 1.5mn 2 flops, exploiting the symmetry of AX k ; in Section 5 we show how to reduce this to 
about mn 2 + '/in 3 in a practical implementation. 


4. Convergence Acceleration by Scaling 
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Schreiber [14] presented the scaled iteration 

X k+ , = a k+ i(2I-X k A)X kt k = 0, 1, • • ■ (4.1) 

where Xo is given by (3.2). Here, we employ an acceleration parameter Ok+i e [1,2] chosen so as 

to minimize a bound on the maximum distance of any nonzero singular value of X k+ i A from one. 

Let us assume that we know bounds a,™ and o max on the singular values of A satisfying 


®min — — Omn- 

In this case, every eigenvalue of A T A lies in the interval [cr m i n ,a mix ]. (We assume this interval has 
nonzero length; otherwise A is a scalar multiple of an orthogonal matrix and Xo is its inverse.) 
We then choose Xq according to (3.2) with 


(Xq = = . 

Ctmin ^ 

It follows from (2.1), (3.2), (4.1) and (4.2) that for all j and k. 


p/Q) = ISl 


®min ^ 


and 


(4.2) 


p J (k+1) = otk + i(2 - pj ck) )pj <k >- 

To determine the acceleration parameters a*, for k > 0, we let 


£ (0) = CtoOmin = 


2o„ 


Omin ^ Omu 


(4.3) 


and 


P (0) = ao<W = 


2o„ 


®min 4" Omax 

these being lower and upper bounds on the singular values of XoA. Then take 


(4.4) 
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(Xlr+1 = pG C+1 ) = 

^ +1 H l+<2-fi0c) )e (k) ’ 

which is both an acceleration parameter and an upper bound on {p/ k+1 ^}, and 


(4.5) 


fifr+i) = a k+ i(2-fi< k ))gf k ) , (4.6) 

which is a lower bound on {p/ k+1) }. The definitions (4.3)-(4.6) imply that for all 1 < j < r and 

k£ 1, 

g(k) ^ pflO £ p(k) f 

and 

g® = 2 — p® . (4.7) 

(Note that (4.7) follows from immediately (4.5) and (4.6). The upper bound pf® < p w is likewise 

straightforward. Finally, if 

0< e 0c)^ptk)^( 2 _ fi (k)) t 

then 

(2-e«)e« < (2-p j < k >)p/ k ) < l, 

whence, by (4.6) and the definition of p /*>, the lower bound gW pjW follows.) 

Except for the last few iterations before conveigence, g w c 1, which implies that Ok+i = 2. 
Thus, p r (k+1) = 4p r W. Therefore, 

- log 4 p r (0) = logztWA) 2 + 1) 1/2 ] = log 2 K(A) + 0(1 /k(A) 2 ) 
steps suffice to bring all the singular values of X k A up to l A, which is half as many as for the 


unaccelerated version. 
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We shall now derive a theorem concerning the optimality of the acceleration parameters 
given by (4.5). Let the symmetric residual matrix initially be 

EsI- o£oA t A = I - XoA. (4.8) 

It is straightforward to show that Newton’s method, starting with Xo = OoA T , produces iterates 

satisfying 

X k A = I - E m (4.9) 

where m = 2 k . For a nonsingular matrix A and for the choice (4.2) for oto, the eigenvalues of E lie 

in the open interval (-1,1), and we have that E m 0 and, therefore, X k A —> I as k °° . 
Furthermore, (4.8) and (4.9) imply that 
X k = (I + E+ +E m - , )a 0 A T . 

Thus, Newton’s method is related to the Neumann series expansion 
(I-E) -1 = I + E + E 2 + ••• 

We therefore ask whether the accelerated method (3.2), (4.1) — (4.6) is related to a better 
polynomial approximation to (ME) -1 . In fact, it is exactly equivalent to approximation of this 
inverse by a Tchebychev polynomial in E, as we now show. 

Let 

T 2 »(Q = cos (2 k cos -1 0 

be the Tchebychev polynomial of degree 2 k on (-1,1). Recall that To(0 = 1, T i(Q = £. and 


T 2 ».i(Q s 2 T 2 »(0 - 1. 


(4.10) 
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The scaled Tchebychev polynomials on (0,^,0 are defined by 


tk(0 = 


Ty(VC + 5) 
T 2 »(5) 


where y= j- — — r- and 5 = ma * + ^ Tmr ^ . Surely, t* is a polynomial of degree 2 k , and 

l '-'max '-'mini fOmax — <3 mini 

tk(0) = 1 , so that 


tk(0 = i-Ct k (0 

for some polynomial Ik of degree 2 k -l; furthermore, it is a classical result that among all such 
polynomials, t k minimizes the norm sup I t k (Q I h 1 1 t k l They also satisfy a 

recurrence like (4.10). Let p k s T 2k (8). Then by (4.10), 

Pktk(C) = 2(p k _it k _i(0) 2 - 1. (4.1 1) 

Theorem 4.1. Let the sequence of matrices X k , k=0, 1, ■ • • be generated by (3.2), (4.1) — 

(4.6). Then 

X k = p k (A T A)A T (4.12) 

where pk(Q is a polynomial of degree 2 k — 1. The polynomial 1 - £ j\(£) is the scaled Tchebychev 

polynomial tk(Q of degree 2 k on (cr min ,c mix ). 

Remark. Before proving the theorem, we point out that (4.12) implies that 

I ~ X k A = I - p k (A T A), 
and therefore that 


I 1 1 - X k A I 1 2 = max 1 1 - pi^Oj 2 ) I , 
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which shows the relevance of the theorem’s conclusion. 

An analogue of this result also holds for Xo any matrix of the form r(A T A)A T , with r a poly- 
nomial, such that the 2-norm of I - XoA is less than one. 

Proof. The claim (4.12) is clearly true when k=0, with po(Q s cto- With the choice (4.2), 

1 - £po(0 = 1-2 C / (cw+Omm) = to(C) is the appropriate scaled Tchebychev polynomial. 

Now use induction on k. A straightforward calculation using (4.5) and (4.7) shows that, for 
all k £ 1, l<tXk<2 and (J k = j , or equivalently that a* = ^ •pj~ • h als° follows from (4.1) 

and the inductive hypothesis in the form (4.12), after multiplying by A, that (4.12) holds for the 
polynomial 

Pk(C) = a k (2-Cp k -i(0)Pk-i(0- 

From this and the relations between a* and Pk we derive the recurrence 
Pkd-CPk) = 2Pk 2 -i(l-CPic-i) 2 -l- 

Thus, Pk(l - CPw) satisfies the Tchebychev recurrence (4.11), which proves the theorem. 

QED 

5. Convergence Acceleration with Cubic Polynomials 

In Section 3 we saw that with the unaccelerated Newton method, the convergence of pf® to 
one is slow for all j such that p/ 0) lies near zero or two. For many input matrices, and for 
moderately large k, the set {p/^J/Lo produced by Newton’s method without acceleration consists 
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of two clusters, one lying near zero and the other near one. In this section, we present an alterna- 
tive to the acceleration method of Section 4 that, in the case of such a large gap in the spectrum of 
X k A, results in much faster convergence. 

Let X be a fixed matrix satisfying XA = VRV T , R = diag(pi, ■ • • ,pn) where 0 < pj < 2 for 
all j. We seek an improved approximation Xj to A + of the form 

X, = (Y 3 (XA) 2 + y 2 XA+y 1 DX. 

We choose { 71 , 72 . 73 ) so that the cubic polynomial c(p) = yip + YzP 2 + "ftp 3 satisfies 
(0 c(l)=l, 

(ii) c'(l) = 0, 

(iii) c(0) = 0, 

(iv) c'(0) > 2. 

The idea here is that small singular values are amplified by the factor c'(0) while those near 
one continue to converge. It is quite evident, however, that c(p) will take large values for some 
p e (0,1), so we must exercise caution. 

We begin by finding p > 0 such that we are certain that there are no eigenvalues pj in 
(p,l-p) or in (l+p,2]. Let T = XA and compute T 2 and 5s I IT-T 2 ! Ip. Now, it follows from 
(3.3) that 


5 2 =.f; [pj(l-pj)] 2 ; 


(5.1) 
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whence, for all 1 £ j £ n, 

Pj ll-Pjl<5. 

If 5 £ Vi, this provides us with no useful information. If, on the contrary, 0 < 5 < Vi, then we con- 
clude that all the eigenvalues pj lie in the two closed intervals: [0,p] and [1 -£.l + PL where 


0 < a = Vi - VVi - 6 < Vi , 
Vi< \-si= i /i + 'lW^5, 


and 


i^i + p=vi+ ■'/Vi + 5 < i + a ■ 

(See Figure 1). 



Figure 1. The intervals [0,e] and [1 - a* 1 + Pl contain all the pj when 
6=11 XA — (XAJ 2 1 I F < Vi. 

Thus, for j = 1, ,n. 


Pj e [o,e]u[i-a,i+p]. 


(5.2) 
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in other words pj is in neither (£.1 -g) nor (l+p,2]. Now we show how to choose c(p) so as to 
satisfy the criteria (i)-(iv), as well as the equally important criteria 

(v) c: [0,g] — » [0,1], 

(vi) c: [I— g,l+p) -» [1,1+g). 

To determine c, we enforce (i)-(iii) together with 

(vii) c(g)=l. 

The unique solution is 

c(p) = j(P 2 - (2+fi)p + (l+2g))p, 
which may be rewritten as 

c(p) = ^-((P-I) 3 + ( 1 — £t)(p — 1 ) 2 ) + 1. 

Theorem 5.1. Let 0 < g < V4. Ifc(p) is the unique cubic satisfying (i), (ii), (iii), and (vii) above, 
then (v) and (vi) hold. 

Proof: To show that (v) holds, we recall that a nonvanishing cubic polynomial c(p) has at most 
two critical points (where c'(p) = 0). One of these is p = 1 (condition (ii)). By (i), (vii) and 
Rolle’s theorem, the other one is in (g,l), so c(p) is monotone on [0,g] and must therefore map 
[0,£] onto [0,1]. 

To establish (vi), note that 

c(p) = l + (P~ ] ) 2 (p - g). 
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Now let p = 1 — e with I e I <g<l/2. Then c(p) — 1 = e 2 (l — £ - g) / g, and since the second factor 
is bounded by 0 and 1, the right-hand side is bounded by 0 and g, establishing (vi). 

QED 

Thus, if 5 ^ 14 we cannot accelerate. In this case, we let Xj = (21 — T)X and proceed to the next 
iteration (i.e., Xi is the result of a Newton step (3.1)). On the other hand, if 5 < l A we compute 
G := Vi - Vi/ 4 - 6 and let Xj = i-(T 2 - (2+p)T + (l+2p)I)X (i.e., X, is the result of a cubic step). 
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Here is a practical algorithm incorporating this idea for computing A + . 

ALGORITHM CUINV. 

INPUT: A 
OUTPUT: A + 

STEP 0 [INITIALIZE]: 

Choose X = otoA T , with ao given by (4.2) or (3.7); 

T := XA; 

T2 VALID := false; 

STEP 1 [NEWTON STEP]: 

if X is sufficiently close to A + then retum OO: 

X := (2I-T)X; 
if T2 VALID then 
T := 2T-T 2 
else 

T := XA; 

endif 

T2 VALID := false; 
if trace (T) >n- l A then 
goto STEP 1; 

STEP 2 [TEST FOR SMALL CHANGE]: 

T 2 :=T 2 ; 

5 := I IT-T 2 I I f ; 
if 5 £ l A then 

T2 VALID := true; 
goto STEP 1; 
else 

goto STEP 3; 

STEP 3 [USE ACCELERATION]: 
e := V4 - Vi/ 4 - 5 ; 

X := i-(TH2+a)T+d+2e)I) X; 

T := XA; 
goto STEP 1; 


COMMENTS on ALGORITHM CUINV. 

(1) Stopping criteria are discussed by Soderstrom and Stewart [17]. 

(2) First , we have made use of the fact that after a Newton step (3.1), 


Tic+i sX k+ iA 

= (2I-X k A)X k A 
= (2I-T k )T k . 
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Thus, if we decide to reject the use of a cubic acceleration step, the computation of T 2 in 
STEP 2 is not wasted, because it saves us at least that much work in the following Newton 
step (STEP 1). 

(3) We do not use cubic steps exclusively; we do need at least one Newton step for each cubic 
step because without Newton steps, we cannot find intervals [0,p] and [1 -f),l + £] known 
to contain all the eigenvalues. Furthermore, the Newton steps finish the job of forcing 
eigenvalues near one to actually converge. 

(4) We use only Newton steps when all the singular values of XA are close to one. This is sig- 
naled by the convergence of trace(XA) to within one half of its limiting value of n. 

If A is an mxn matrix, then we assume that the cost of computing the product XA is Vimn 2 
flops (we exploit the symmetry of XA), the cost of computing the product T 2 is Vin 3 flops, and the 
cost of computing X in STEP 1 or STEP 3 is mn 2 flops. Then, if we skip STEP 3, an iteration of 
CUINV costs 

nfo + '/in 3 flops, 

assuming that T2 VALID was true. The cost of an iteration in which we take STEPS 1 — 3 is 
3mn 2 + '/in 3 , 

assuming that T2VALID was false. These assumptions are warranted since it is the iterations that 
take STEP 3 that cause T2 VALID to be false. Thus, for n = m, we pay a premium of only about 
16% compared with two Newton steps. 

What is the effect of a full iteration assuming STEP 3 is taken? Formally, the eigenvalues 
of XA are mapped as follows: 

p -» c((2-p)p) = l+i!=£)-(l-p) 4 - -I(l-p) 6 h C(p;e). (5.3) 

For small p, C(p;p) = (4+-^-)p. And because it will often be the case that pci, the combined 

iteration greatly amplifies the small eigenvalues. Those near one continue to converge super- 
lineariy; equation (5.3) shows that 

1 1-C(p;e) I = (1-P) 4 + 0((l-p) 6 ) 
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^ d-fiXl-p) 3 + 0((l-p) 6 ) = 0((l-p) 3 ) . 

since any eigenvalue p of XA exceeding p must be in [1 - p,l + g] , so g > 1-p. 

See Section 10 for some experimental evidence of the efficiency and reliability of ALGO- 
RITHM CUINV. 


6. An Improved Initial Iterate in the Symmetric Positive Definite Case 

Let us consider the case of a symmetric positive definite matrix A, which is important in 
many applications [6]. Given a matrix A with the SVD 

A = VIV T , I = diag(Oi£ ••• £o n >0) (6.1) 

(which is its eigendecomposition, too) we shall choose 

Xo = pi + aoA (6.2) 

so as to optimally place the singular values of XoA. From (6.1) and (6.2) 

XoA = V(pZ + OoI 2 )V T . 

We choose (P.cxo) so that, with p(a) * Pa + ctoo 2 , the largest possible initial error 
1 1 1 - XoA 1 1 2 h max I p(a) - 1 1 

o.SoSo, r 

is minimized. By standard arguments (see [5], ch. 9), 1 - p is a scaled Tchebychev polynomial 
on (a„,Oi): 


1 - p(o) = T 2 (o) = 2(0 °™ d)2 ~ 52 
2ai£id - S 2 

where Omid = ^(o n + a0 and 8 = aj - amid- Moreover, 

I II-XoAl 1 2 = max IT 2 (o)l =T 2 (o,)= - y . 

a - SaSo > Zo^id-S 2 

We are most concerned about the smallest singular value of XqA, which is 


Let to = 8/ Omid < 1. The condition number k is given by k = Thus 


„ _ Omid + 5 _ 1 +co . 

Omid -6 1 -0) ’ 
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whence 


(0 = 


K— 1 

k+T 


Thus, the smallest initial singular value of XqA is 


p(o n ) = 


2(1 - to 2 ) 

2 - G) 2 

2((K+1) 2 -(1C-Ij 2 l 

2(k+1) 2 -(k-1) 2 

8k 

k 2 + 6k + 1 


(6.3) 


This asymptotically is 8k -1 + 0(k -2 ), which is far larger than the 2k 2 provided by the 
“optimal” choice Xo = otoA (see (3.6))! To find (3, do, and hence Xq, we use the equations 


(Jo + OoO 2 = p(o) = 1 - T 2 (o) 

_ 2CT 2 ld - S 2 - 2(o - a m jd) 2 + S 2 

20md - S 2 
_ -2q 2 + 4qo m id 
2qmid - S 2 

Hence 


q _ 4q mid 
P "2q 2 id -6 2 ’ 


OtoH- 


2q^id-5 2 ‘ 


(6.4) 


Remark. Even without computing estimates or bounds for the extreme singular values we 
may improve the initial approximation when A is symmetric positive definite by choosing 
Xo = (l / I I A I If) I; forinthis case I I I-XqA I l 2 ^ 1 - 1 /(n m K). 


7. Suppressing the Smaller Singular Values 

In this section, given a matrix A and a positive scalar e, we show how to compute A(e) and 
A+(e) , where A(e) is obtained from A by suppressing (that is, setting to zero) all the singular 
values of A that do not exceed e . (As noted in Section 2 above, A(e) is a closest approximation 
to A by matrices whose rank is that of A(e). In practice, solving least squares problems often 
requires the use of this form of regularization; for when A is very badly conditioned, its general- 
ized inverse is largely unknowable due to perturbations in A, but the reduced-rank generalized 
inverses are much better conditioned. This idea is discussed more folly by Golub and Van Loan 
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[6, Section 5.5].) 

In order to compute these rank-reduced generalized inverses, we require a polynomial c(p) 
that gives fast convergence to 0 near p = 0 and to 1 near p = 1. Consider first the iteration 

Xk+i/2 = (21— X k A)X k , X k+ i = X k+ i/2AX k+ i/2 (7.1) 

that was proposed by the second author [14]. The eigenvalues of X k A satisfy the equations 

Pj f +1) = ((2-p/ k) )p/ k) ) 2 for all j and k. Figure 2 illustrates the effect of this mapping. The fixed 
points are p 0 = 0, p! = (3-V5)/2 = .3819 ••• , p 2 = 1 , and p 3 = (3 + V5)/2 = 2.618 ••• which 
are the roots of the quartic (2-p) 2 p 2 = p. 



Figure 2. The mapping (2 - p) 2 p 2 . 

Consider especially the intervals {p: 0 < p < pi) and {p: pi <p < 2— pj); note that 
2-pi = (1+^5) /2 = 1.618 ••• Evidently, the eigenvalues from the former interval are sent 
towards zero and from the latter interval towards one. The convergence is ultimately quadratic 
but is slow near pi and 2— pj . To compute A + (e), it suffices to apply the iteration (4.1) — (4.6) 
with appropriate Cnun and o max until the following relations hold: 
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0 £ pjM < p i if and only if Oj < e, (7- 2 ) 

Pi < p/ k) < 2-pi otherwise. &•?>) 

We can satisfy (7.2) for k=0 by setting cto = pi / e 2 , but then (7.3) may not hold. Therefore, we 

proceed as follows: 


0) Compute an upper bound 0 ^ on a?. 

1) Set ao = min {2 / (a m „ + e 2 ), p, / e 2 }. This insures that the eigenvalues p/°> (of XoA) 
corresponding to the singular values Oj t e (of A) are all closer to one than the smaller 
eigenvalues. Set p (0) = o<oa m „, f> (0) = cto£ 2 . 


2 ) 

3) 

4) 


Apply the iteration (4.1) with parameters given by (4.5) and (4.6) until ^ Pi- 


Set X k := 


Pi 




00 


X k 


Apply the iteration (7.1) until the matrix A + (e) has been computed with the desired 


accuracy. 

The scaling at Step (3) is done to insure that all the small singular values (less than e ) are in fact 


suppressed. 

The iteration (7.1) associated with the quartic polynomial (p(2-p)) 2 is not the most efficient 
way of computing A + (e). The same objective can be achieved by using the iteration 


X<k+D = (-2XWA+3I)X«AX« ( 7 - 4 ) 

associated with the cubic polynomial C(p) = -2p 3 +3p 2 (see Figure 3). Note that C(l)=l, 
C(0) = C'(0) = C'(l) = 0, C(l/2) = 1/2, so that the mapping £(p) has three nonnegative fixed points 
0, 1/2 and 1. Let p 4 = (l+^3)/2 = 1.366 • * • be the unique solution to -2p| + 3p| = 1/2 greater 
than one. The eigenvalues of X^A in the interval (p: 0 S p < 1/2) are sent towards zero, and the 
eigenvalues in the interval {p: 1/2 < p ^ p 4 ) are sent towards one; the convergence to zero and 
one is ultimately quadratic but is slow near 1/2 and p 4 . 
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Figure3. The mapping -2p 3 + 3p 2 . 

The iteration (7.4) is simpler than (7.1): it requires three matrix multiplications, one less 
than is needed in (7.1). 

Remark 7.1. We may also compute A(e) itself since 

A(e) = AA + (e)A . (7.5) 

Remark 12 . We may use either of the methods (7.1) or (7.4) to "split” a matrix A into 
two better-conditioned matrices A(e) and A(e) such that 

A = A(e) + A(e) . 

Moreover, if e is placed where there is a large gap in the singular values of A, then faster conver- 
gence is possible. For A + may be computed by the formula 


A + = A + (e) + A + (e) . 

The matrices A(e) and A(e) may be much better conditioned than A. 
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8. Stability of the Basic and Modified Iterations 

It is well-known that Newton iteration (3.1) is numerically stable and even self-correcting if 
the input matrix A is nonsingular. If A is singular, however, then it is very mildly unstable. 

Let A and A + have the SVDs 


and 



V T , 



X k+ i=X k + (I-X k A)X k 

I + E n 
~ E21 2E22 ‘ 

Due to the block 2 E 22 , the iteration (3.1) is mildly unstable if A is singular (in which case the 
(2,2) block above is not empty). After 21og 2 K(A) iterations, rounding errors of order k^A) can 
accumulate. In Figure 4, this phenomenon is illustrated. Here, A was 6x6 with condition number 
30 and rank four. The method converges in 17 iterations. All logarithms are base 10 and the Fro- 
benius norm is used. Note that the norm of the off-diagonal part of VXV T grows exponentially. It 
reaches a value about four orders of magnitude above machine precision (which is roughly 
10 -16 ); a loss of four digits of precision results. 
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Figure 4. Mild Instability of Newton Iterations in the Rank-Deficient Case. 

On the other hand, the iteration (7.1) is stable even for singular A. Indeed, by (7.1) and 


( 8 . 1 ) 


X(k+l/2) - V 


r* Eh 

E21 2E22 


UT, 


hence 


I E 12 
E21Z 2E22 


X (k+1/2) A _ V 

and therefore, 

X(k+D = x<k+i«>A Xfr* 1 ' 2 ) 


V T , 


= V 


E+ E 12 
E21 0 


U T . 


Thus we deduce that the iteration (7.1) is stable for any matrix A. Similarly , we deduce that the 
iteration (7.4) is stable for any matrix A. Thus, the methods of Section 7 have the dual advantage 
of stability and a well-conditioned solution, in contrast to the use of Newton iteration (or its 
accelerated form) on A. (If one wishes to compute A + , then the instability of Newton’s method 
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can be partly removed by using a few iterations (7.1) or (7.4) after all the significant singular 
values have converged. Some practical details of this technique are discussed in Section 10.) 


9. Computing the Projection onto a Subspace Spanned by Singular Vectors 

In this section we discuss a modification of the Newton iterations discussed above that 
allows us to compute the orthogonal projection matrices onto subspaces spanned by the singular 
vectors corresponding to either the dominant or the smallest singular values at less expense than 
computing the generalized inverse of A. Important applications to spectral estimation and direc- 
tion finding with antenna arrays were developed by Schmidt [15]. 

Let us define the matrices 


and 


P(e) = AA + (e) = U 


I 0 
00 


U T 


(9.1) 


P*(e) = A + (e)A = V 



(9.2) 


where the matrices U and V are from (2.1), I is the r(e)xr(e) identity block, and r(e) is the number 
of the singular values of A that are not less than e. Then P(e) and P* (e) are the orthogonal projec- 
tions onto the subspaces spanned by the first r(e) columns of the matrices U and V, respectively. 
Our previous results already give us some iterative algorithms for computing 
A(e) = (A + (e)) + = AA(e) + A, as well as P(e) and P*(e), but there are simpler and more efficient 
algorithms that we shall give shortly. In addition to the signal processing application mentioned 
above, we may use this technique as an alternative to the methods of Section 7 for computing 
A(e), since 


A(e) = P(e)A = AP*(e) . (9.3) 

The following iteration extends (7.4) and converges to P(e), unless e is a singular value of 

A. Let 


P 0 = aoAA T +pI, 

where we choose oto > 0, P > 0, to satisfy 


(9-4) 
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Oo£ 2 + p = 1/2; aoof + p<p 4 = 1.37 ••• . (9.5) 

then iterate as follows: 

P k+1 = (-2P k +3I)P^ = (I-2(P k -Q)Pj? k=0, 1, ••• (9.6) 

The convergence of P k to P(e) immediately follows from the considerations of Section 7. 

The iteration (9.4)-(9.6) converges to P*(e) if we replace AA T by A T A in (9.4). Further- 
more, we may compute A(e) by using (9.3), which is superior to the solution given in Section 7 
because we now need to compute a single generalized inverse (rather than two). Also, each itera- 
tion step (9.6) only involves two matrix multiplications. And finally, if A is rectangular then one 
of these iterations is less expensive than (7.4) (for example) because it involves smaller sym- 
metric matrices. 

Remark 9.1. The stability analysis of Section 8 can be immediately extended to the itera- 
tion (9.6). 



Remark 92 . We may accelerate the cubic iteration for P(e) as follows. At the early stages 
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of the iteration, it is more important to move singular values away from 0.5. After a step 
p = 3P 2 - 2P 3 , we have that the spectrum of P lies in the closed interval [0,1]. We then replace P 
by aP + pi where ap + P is a line that maps [0,1] into ( - (£4 - l),p 4 )- To get the best possible 
speedup by this method, we choose such a line and also require that a ( l A) + P = Vi 

Remark 9 J. Our ability to compute the projectors P(e) allows us, with a little additional 
computation, to do the following: 

(1) The projector P(e) defines the rank of the matrices A(e) and A + (e), for 

rank A(e) = rank A + (e) = 1 1 P(e) 1 1 f = trace(P(e)) . 

This observation may be used as the basis of a bisection strategy for computing the singular 
values of A in polylog time. Indeed, the singular values of A are those of A(e) together 
with those of A - A(e). We may in this way reduce the problem to that of computing the 
positive singular value of a matrix with only one positive singular value. This we discuss in 
point (4). It is straightforward to develop a similar polylog algorithm for the eigenvalues of 
any symmetric matrix. 

(2) We may compute projectors P(£i,£ 2 ) onto subspaces spanned by singular vectors belonging 
to all the singular values in [ei,E 2 ), since P(ei,£ 2 ) = P(ei) - P(£ 2 )- 

(3) We may determine easily, for a given vector x, whether or not x e S(e) where S(e) is the 
span of the singular vectors corresponding to singular values greater than or equal to e, by 
checking whether or not x = P(e)x. 

(4) We may rapidly compute any singular value, regardless of multiplicity, as soon as we have 
found an interval [e^] that contains this singular value, a, and no other. For a is the only 
singular value of A(e I ,e 2 ) = (P(ei) -P(e 2 ))A. Its multiplicity k is given by 
trace(P(ei) - P(£ 2 ». And a 2 = trace(A T (ei ,£ 2 ) A(£ t , e 2 )) / k. 

10. Experimental Results 

We generated a random 64x64 matrix, then changed its singular values so as to create an 
ill-conditioned matrix A whose singular values lie in two clusters. There are 32 singular values 
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in the interval [1, 7.6] and 32 others in the interval [lO" 7 ,!©" 6 ]. 


iter 

trace(XA) 

1 1 X A - 1 1 1 

Q. 

0 

8.6801e-01 

1.0000e+00 

0 

1 

1.6882e+00 

1.0000e+00 

0 

2 

3.1987e+00 

1.0000e+00 

0 

3 

5.7784e+00 

1.0000e+00 

0 

4 

9.6436e+00 

1.0000e+00 

0 

5 

1.4398e+01 

1.0000e+00 

0 

6 

1.9113e+01 

1.0000e+00 

0 

7 

2.3340e+01 

1.0000e+00 

0 

8 

2.7108e+01 

1.0000e+00 

0 

9 

3.0015e+01 

1.0000e+00 

0 

10 

3.2111e+01 

1.0000e+00 

2.048 le-01 

11 

3.2003e+01 

1.0000e+00 

3.033 le-03 

12 

3.2014e+01 

9.9998e-01 

8.4572e-06 

13 

3.5992e+01 

9.9357e-01 

7.0355e-03 

14 

3.8974e+01 

9.8718e-01 

0 

15 

4.3075e+01 

9.7452e-01 

0 

16 

4.7663e+01 

9.4970e-01 

0 

17 

5.2046e+01 

9.0192e-01 

0 

18 

5.5954e+01 

8.1346e-01 

0 

19 

5.9225e+01 

6.6172e-01 

0 

20 

6.1749e+01 

4.3787e-01 

0 

21 

6.3309e+01 

1.9173e-01 

0 

22 

6.3906e+01 

3.6761e-02 

0 

23 

6.3997e+01 

1.3514e-03 

0 

24 

6.4000e+01 

1.8267e-06 

0 

25 

6.4000e+01 

7.6424e-09 

0 


Table 1 : Results for First Test Matrix 


We used algorithm CUINV to compute A -1 . The initial iterate was that of (3.2) and (3.7). The 
results are shown in Figure 6; the data are in Table 1, which gives trace(X|<A), 1 1 X^A - 1 1 1 2 , and 
the computed bound £ for k=0, 1, ••• ,12. Where p = 0 the algorithm skipped the cubic 
acceleration step. Early on this is because 5 is too large, as the first 32 large singular values con- 
verge. By way of comparison, Newton’s method takes 60 iterations to obtain the solution pro- 
duced by CUINV in 25. 
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trace(XA) log ilXA - Dl 



iteration iteration 

Figure 6. Convergence history for first tea problem. 

Next, we repeated this experiment with a second random matrix of order 64, having all its 

singular values in [0.066, 1]. The method chose unaccelerated Newton steps (19 are required) 
every time. Cubic steps were never used; this was due to the absence of any gap in spectrum of 
XkA. 

We therefore modified the algorithm so that when cubic acceleration is ruled out it uses a 
step of adaptive Tchebychev acceleration as discussed in Section 4. For die practical application 
of this idea, we require a means of finding a bound p*>0 such that pP^<p*. We then use as the 
acceleration parameter 



Hiis insures that the acceleration process does not cause a large singular value to be mapped to 
the left of the smallest, thereby making the problem more difficult Assume that the test 5 < 14 at 
STEP 2 of CUINV fails. Let 
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5 = 5 /^. 

Then, if § < V*, we take 

p» = x h — ^Va — 8 
Now, since (5.1) holds, 

Now it must be the case that p r (the smallest positive singular value of X k A) is less than p. or else 
all of the singular values are in (1 - p*, 1], To rule out the latter possibility, we check whether 

trace(XkA) £ n (1 - p.) 
and, if not, we accelerate. 

With this change, the number of iterations required for convergence dropped to 13 for this 
problem; the operation count went from 20.5e6 to 14.6e6. 

Evidently, even for well-conditioned matrices, modified CLJINV can be far more efficient 
than Newton’s method; for moderately ill-conditioned matrices, the differences are pronounced. 

Next, we generated a random 64x64 matrix, then changed its singular values so as to create 
an ill-conditioned matrix A with 54 singular values in [10" ,6 ,10" n ] and the remaining 10 in 
[0.01, 1]. We computed the generalized inverse of A(l.e-10), the matrix of rank 10 obtained by 
suppressing the small singular values of A. We first used algorithm CUINV with A, choosing 
oto= 1. In addition we monitored the growth of e which was initially set equal to e^lO -20 and 
which is updated according to the recursions: 

e := (2 - e)e 
in Step 1 and 

e := -±-(e 2 -(2 + a)e + (l+2e))e 

in Step 3. Thus, e separates the singular values of XA that we wish to suppress from those that 
we wish to map to one. At the first iteration in which the parameter g, computed in Step 3, is less 
than £, we stop and switch to the iteration (7.4) for, since (5.2) holds, the unwanted singular 
values of XA must now be smaller than x h. In this example, the switch occurred after the seven- 
teenth iteration. Table 2 gives the results; see also Figure 7. 



log fl XA(e) - pmv(A(e))A(e)ll 


aeration 


iteration 


Figure 7. Convergence history for second test problem. 


iter 


1 1 XA(e) - A(e) + A 1 1 

e 

& 

1 

3.9667e-01 

9.9997e-01 

1.0000e-20 

0 

2 

7.2843e-01 

9.9995e-01 

2.0000e-20 

0 

3 

1.2410e+00 

9.9989e-01 

4.0000e-20 

0 

4 

1.8730e+00 

9.9978e-01 

8.0000e-20 

0 

5 

2.436 le+00 

9.9957e-01 

1.6000e-19 

0 

6 

2.8749e+00 

9.9913e-01 

32000e-19 

0 

7 

4J728e+00 

9.9271e-01 

2.6999C-18 

4J074e-01 

8 

5.3217e+00 

9.8548e-01 

5.3998e-18 

0 

9 

6.0588e+00 

9.7117e-01 

1.0800e-17 

0 

10 

6.7139e+00 

9.4318e-01 

2.1599e-17 

0 

11 

7.3764e+00 

8.8958e-01 

4.3198e-17 

0 

12 

8.1084e+00 

7.9136e-01 

8.6396e-17 

0 

13 

8.795 8e+00 

6.2625e-01 

1.7279e-16 

0 

14 

9.3559e+00 

3.9219e-01 

3.4558e-16 

0 

15 

1.0110e+01 

9.2682e-02 

5.3991e-15 

1.7207e-01 

16 

1.0008e+01 

8.4375e-03 

1.2779e-12 

8^950e-03 

17 

1.0000e+01 

7.1182e-05 

3J906e-08 

7.1192e-05 

18 

1.0000e+01 

3.0452e-ll 


... 

19 

1.0000e+01 

3.0452e-ll 

• •• 

... 


Table 2: Computing A(e) + 

The computed generalized inverse is accurate to about 11 digits. This is somewhat fewer than we 
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could wish, given that the condition number of this problem was 100. The loss of accuracy is due 
to the fact that the accelerated Newton process (CUINV) went “too far”, (raising e by 12 orders 
of magnitude) before detecting the possibility of switching to the stable procedure (7.4). 

11. Discussion 

It has been our purpose here to clarify and illustrate the potential for the use of variants of 
Newton’s method to solve problems of practical interest on highly parallel computers. We have 
shown how to accelerate the method substantially. We have shown how to modify it to success- 
fully cope with ill-conditioned matrices. We have developed practical implementations. We con- 
clude that Newton’s method can be of value for some interesting computations, especially in 
parallel and other computing environments in which matrix products are especially easy to work 


with. 
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